1 Climate change and temperature anomalies

To define temperature anomalies you need to have a reference, or base, period which NASA clearly states that it is the period between 1951-1980.

weather <- 
  read_csv("https://data.giss.nasa.gov/gistemp/tabledata_v4/NH.Ts+dSST.csv", 
           skip = 1, 
           na = "***")

In the code below, we select the year and the twelve month variables from the weather dataset and convert the dataframe from wide to ‘long’ format.

tidyweather <- weather %>% 
  select(1:13) %>% 
  pivot_longer(cols = 2:13,
               names_to = "Month",
               values_to = "delta")

1.1 Plotting Information

Let us plot the data using a time-series scatter plot, and add a trendline.

tidyweather <- tidyweather %>%
  mutate(date = ymd(paste(as.character(Year), Month, "1")),
         month = month(date, label=TRUE),
         year = year(date),
         Month = factor(Month, 
                       levels = c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"))
  )

ggplot(tidyweather, aes(x=date, y = delta))+
  geom_point()+
  geom_smooth(color="red") +
  theme_bw() +
  labs (
    title = "Weather Anomalies",
    caption = "Figure 1.1"
  )+
   theme(plot.caption = element_text(hjust = 0))

Is the effect of increasing temperature more pronounced in some months?

We now plot the temperature deviation separately for each month as shown below in Figure 1.2. In recent years, there seems to be greater trend towards a higher temperature deviation.

ggplot(tidyweather, aes(x=date, y = delta))+
  geom_point()+
  geom_smooth(color="red") +
  facet_wrap('Month')+
  theme_bw() +
  labs (
    title = "Weather Anomalies",
    caption = "Figure 1.2"
  )+
  theme(plot.caption = element_text(hjust = 0))

comparison <- tidyweather %>% 
  filter(Year>= 1881) %>%     #remove years prior to 1881
  #create new variable 'interval', and assign values based on criteria below:
  mutate(interval = case_when(
    Year %in% c(1881:1920) ~ "1881-1920",
    Year %in% c(1921:1950) ~ "1921-1950",
    Year %in% c(1951:1980) ~ "1951-1980",
    Year %in% c(1981:2010) ~ "1981-2010",
    TRUE ~ "2011-present"
  ))

Inspect the comparison dataframe by clicking on it in the Environment pane.

1.1.1 Density plot to study the distribution of monthly deviations (delta), grouped by the different time periods.

ggplot(comparison, aes(x=delta, fill=interval))+
  geom_density(alpha=0.2) +   #density plot with tranparency set to 20%
  theme_bw() +                #theme
  labs (
    title = "Density Plot for Monthly Temperature Anomalies",
    y     = "Density"    ,     
    caption="Figure 1.3"
  )+
  theme(plot.caption = element_text(hjust = 0))

1.1.2 Scatterplot for Average Annual Anomalies:

#creating yearly averages
average_annual_anomaly <- tidyweather %>% 
  group_by(Year) %>%
  summarise(annual_average_delta = mean(delta,na.rm=TRUE))

#plotting the data:
ggplot(average_annual_anomaly, aes(x=Year, y= annual_average_delta))+
  geom_point()+
  
  #Fit the best fit line, using LOESS method
  geom_smooth() +
  theme_bw() +
  labs (
    title = "Average Yearly Anomaly",
    y     = "Average Annual Delta",
    caption = "Figure 1.4"
  ) +                        
theme(plot.caption = element_text(hjust = 0))

1.2 Confidence Interval for delta

NASA points out on their website that

A one-degree global change is significant because it takes a vast amount of heat to warm all the oceans, atmosphere, and land by that much. In the past, a one- to two-degree drop was all it took to plunge the Earth into the Little Ice Age.

We construct a confidence interval for the average annual delta since 2011, both using a formula and using a bootstrap simulation.

formula_ci <- comparison %>%
  filter(interval == "2011-present") %>%
  summarise(
    min_delta =  min(delta,na.rm=TRUE),
    max_delta =  max(delta,na.rm=TRUE),
    median_delta =  median(delta,na.rm=TRUE),
    mean_delta = mean(delta,na.rm=TRUE),
    sd_delta = sd(delta,na.rm=TRUE),
    count = n(),
    se_delta = sd_delta/ sqrt(count),
    t_critical = qt(0.975, count - 1 ),
    lower = mean_delta - t_critical * se_delta,
    upper = mean_delta + t_critical * se_delta
)


formula_ci %>% kbl(caption = "Table 1.1: Confidence Interval using Formula Method ") %>%
  kable_classic(full_width = F, html_font = "Cambria")
Table 1.1: Confidence Interval using Formula Method
min_delta max_delta median_delta mean_delta sd_delta count se_delta t_critical lower upper
0.46 1.94 1.04 1.06 0.276 132 0.024 1.98 1.01 1.11
library(tidyverse)
library(janitor)
library(infer)
library(here)
library(skimr)

set.seed(1234)

mean_delta <- 1.06
sd <- 0.276
n <- 127
se <- sd / sqrt(n)
lower95 <- mean_delta - qt(0.975,n-1) * se
upper95 <- mean_delta + qt(0.975,n-1) * se

bootstrap_comp <- comparison %>%
  filter(interval == "2011-present",na.rm=TRUE) %>%
  specify(response = delta) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "mean")

percentile_ci <-bootstrap_comp %>%
  get_ci(level = 0.95, type = "percentile")

mean_deltq <- ggplot(bootstrap_comp, aes(x = stat)) +
  geom_histogram() +
  labs(title= "Bootstrap distribution of means",
       x = "Mean Delta per bootstrap sample",
       y = "Count") +
  geom_vline(xintercept = percentile_ci$lower_ci, colour = 'orange', size = 2, linetype = 2) +
  geom_vline(xintercept = percentile_ci$upper_ci, colour = 'orange', size = 2, linetype = 2)

visualize(bootstrap_comp) +
  shade_ci(endpoints = percentile_ci,fill = "khaki")+
  geom_vline(xintercept = lower95, colour = "red")+
  geom_vline(xintercept = upper95, colour = "red")

library(patchwork)
mean_deltq

percentile_ci
## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1     1.01     1.11

What is the data showing us?

If we were to plot the confidence intervals as shown above using both the formula and infer method, this shows that they are both fairly close to each other and means that the bootstrap simulation gives a very good result of our estimated population delta.

2 General Social Survey (GSS)

gss <- read_csv(here::here("data", "smallgss2016.csv"), 
                na = c("", "Don't know",
                       "No answer", "Not applicable"))

2.1 Instagram and Snapchat, by sex

Can we estimate the population proportion of Snapchat or Instagram users in 2016?

2.1.1 Create a new variable, snap_insta that is Yes if the respondent reported using any of Snapchat (snapchat) or Instagram (instagrm), and No if not. If the recorded value was NA for both of these questions, the value in your new variable should also be NA.

snap_insta <- gss %>% 
  pivot_longer(cols = 3:4, #columns 3 to 5
               names_to = "Snap/Insta",
               values_to = "Recorded Answer")

2.1.2 Calculate the proportion of Yes’s for snap_insta among those who answered the question, i.e. excluding NAs.

We now display the proportion of people who answered yes and no"

snap_insta2 <- snap_insta %>% 
  filter(`Recorded Answer`== c("Yes","No")) 

prop.table(table(snap_insta2$`Recorded Answer`))
## 
##    No   Yes 
## 0.753 0.247

2.1.3 Using the CI formula for proportions, we now construct 95% CIs for men and women who used either Snapchat or Instagram

table(snap_insta2$`Recorded Answer`)
## 
##  No Yes 
## 949 311
949+311
## [1] 1260
prop.test(311,1260)
## 
##  1-sample proportions test with continuity correction
## 
## data:  311 out of 1260
## X-squared = 322, df = 1, p-value <2e-16
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.223 0.272
## sample estimates:
##     p 
## 0.247

2.2 Twitter, by education level

2.2.1 We first turn degree from a character variable into a factor variable.

gss<- gss %>% 
  mutate(
    degree = factor(degree, 
                       levels = c("Lt high school", "High School", "Junior college","Bachelor","Graduate")))
skimr::skim(gss)
Data summary
Name gss
Number of rows 2867
Number of columns 7
_______________________
Column type frequency:
character 6
factor 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
emailmin 0 1 1 2 0 15 0
emailhr 0 1 1 3 0 42 0
snapchat 0 1 2 3 0 3 0
instagrm 0 1 2 3 0 3 0
twitter 0 1 2 3 0 3 0
sex 0 1 4 6 0 2 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
degree 1469 0.49 FALSE 4 Bac: 536, Lt : 328, Gra: 318, Jun: 216

2.2.2 We now create a new variable bachelor_graduate that is Yes if the respondent has either a Bachelor or Graduate degree. As before, if the recorded value for either was NA, the value in your new variable will also be NA.

gss <- gss %>% 
  mutate(bachelor_graduate = case_when(degree %in% c("Bachelor","Graduate")~"Yes",
         degree %in% c("Lt high school", "High School", "Junior college")~"No" ))

2.2.3 We now calculate the proportion of bachelor_graduate who do (Yes) and who don’t (No) use twitter.

prop.table(table(gss$bachelor_graduate))
## 
##    No   Yes 
## 0.389 0.611
#Longer code but this works too
# t1 <- table(gss$bachelor_graduate)["Yes"]
# t2 <- table(gss$bachelor_graduate)["No"]
# t1/(t1+t2)
# t2/(t1+t2)

2.2.4 Using the CI formula for proportions, we now construct two 95% CIs for bachelor_graduate vs whether they use (Yes) and don’t (No) use twitter.

table(gss$bachelor_graduate)
## 
##  No Yes 
## 544 854
544+854
## [1] 1398
prop.test(854,1398)
## 
##  1-sample proportions test with continuity correction
## 
## data:  854 out of 1398
## X-squared = 68, df = 1, p-value <2e-16
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.585 0.636
## sample estimates:
##     p 
## 0.611
prop.test(544,1398)
## 
##  1-sample proportions test with continuity correction
## 
## data:  544 out of 1398
## X-squared = 68, df = 1, p-value <2e-16
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.364 0.415
## sample estimates:
##     p 
## 0.389
  1. Do these two Confidence Intervals overlap?

No, these two CI’s do not overlap as they are essentially the opposite of one another.

2.3 Email usage

2.3.1 We now create a new variable called email that combines emailhr and emailmin to reports the number of minutes the respondents spend on email weekly.

gss2 <- gss %>% 
  filter(emailmin != 'NA') %>% 
  mutate(emailhr = as.numeric(emailhr),
         emailmin = as.numeric(emailmin),
         email = emailhr*60 + emailmin)

2.3.2 We now visualise the distribution of this new variable. Is the mean or the median a better measure of the typical amount of time Americans spend on email weekly? Why?

The median would be a better measure as it is less likely to be greatly affected by extreme values. We can see that the amount of time ranges from 0 to 6000 which is a fairly large range and hence the median is a better measure in this case.

g1 <- gss2 %>% 
  summarise(  median_email =  median(email),
    mean_email = mean(email),
            min_time = min(email),
            max_time = max(email),
            sd_time = sd(email))
colnames(g1) <- c("Median Time Spent on Emails (mins)","Average Time Spent on Emails (mins)", "Minimum Time Spent on Emails (mins)","Maximum Time Spent on Emails (mins)","SD Time Spent on Emails (mins)")
g1 %>% kbl(caption = "Table 2.1 ") %>%
  kable_classic(full_width = F, html_font = "Cambria")
Table 2.1
Median Time Spent on Emails (mins) Average Time Spent on Emails (mins) Minimum Time Spent on Emails (mins) Maximum Time Spent on Emails (mins) SD Time Spent on Emails (mins)
120 417 0 6000 680
gss2 %>%
  ggplot( aes(x = email)) +
  geom_histogram(bins= 50)+
  theme_bw()+
  labs(
    title = "Distribution of time spent on email",
    x = "Time Spent on Emails",
  )+
  theme(axis.text = element_text(size=5))+
  scale_x_log10()

  NULL
## NULL

2.3.3 Using the infer package, we now calculate a 95% bootstrap confidence interval for the mean amount of time Americans spend on email weekly.

The 95% bootstrap CI has been calculated and put in table 2.2 below after converting it into the hours and minute format using the lubridate package.

set.seed(1234)
skim(gss2)
Data summary
Name gss2
Number of rows 1649
Number of columns 9
_______________________
Column type frequency:
character 5
factor 1
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
snapchat 0 1.0 2 3 0 3 0
instagrm 0 1.0 2 3 0 3 0
twitter 0 1.0 2 3 0 3 0
sex 0 1.0 4 6 0 2 0
bachelor_graduate 828 0.5 2 3 0 2 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
degree 828 0.5 FALSE 4 Bac: 341, Gra: 197, Jun: 142, Lt : 141

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
emailmin 0 1 3.32 9.26 0 0 0 0 59 ▇▁▁▁▁
emailhr 0 1 6.89 11.37 0 0 2 8 100 ▇▁▁▁▁
email 0 1 416.84 680.15 0 50 120 480 6000 ▇▁▁▁▁
mean_email <- 417
sd <- 680
n <- 1649
se <- sd / sqrt(n)
lower95 <- mean_email - qt(0.975,n-1) * se
upper95 <- mean_email + qt(0.975,n-1) * se

bootstrap_comp <- gss2 %>%
  specify(response = email) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "mean")

percentile_ci <-bootstrap_comp %>%
  get_ci(level = 0.95, type = "percentile")

mean_email <- ggplot(bootstrap_comp, aes(x = stat)) +
  geom_histogram() +
  labs(title= "Bootstrap distribution of mean time spent on email",
       x = "Mean time spent on email per bootstrap sample",
       y = "Count") +
  geom_vline(xintercept = percentile_ci$lower_ci, colour = 'orange', size = 2, linetype = 2) +
  geom_vline(xintercept = percentile_ci$upper_ci, colour = 'orange', size = 2, linetype = 2)

visualize(bootstrap_comp) + 
  shade_ci(endpoints = percentile_ci,fill = "khaki")+
  geom_vline(xintercept = lower95, colour = "red")+
  geom_vline(xintercept = upper95, colour = "red")+
  labs(x= "Mean Amount of time spent on emails (mins)")

library(lubridate)
seconds_to_period(percentile_ci*60)
## [1] "6H 25M 9.95239539114118S" "7H 32M 43.6036992116242S"
p1 <- percentile_ci %>% 
  mutate(lower_ci = seconds_to_period(lower_ci*60),
         upper_ci= seconds_to_period(upper_ci*60))

colnames(p1) <- c("Lower Confidence Interval","Upper Confidence Intercal")

p1 %>% kbl(caption = "Table 2.2 ") %>%
  kable_classic(full_width = F, html_font = "Cambria")
Table 2.2
Lower Confidence Interval Upper Confidence Intercal
6H 25M 9.95239539114118S 7H 32M 43.6036992116242S

We would expect the 99% confidence interval to be wider than the 95% confidence interval, as it contains more values in order to be 99% certain (as opposed to 95% certain) that the true population mean is located within the interval.

3 Biden’s Approval Margins

As we saw in class, fivethirtyeight.com has detailed data on all polls that track the president’s approval

# Import approval polls data directly off fivethirtyeight website
approval_polllist <- read_csv('https://projects.fivethirtyeight.com/biden-approval-data/approval_polllist.csv') 

glimpse(approval_polllist)
## Rows: 1,597
## Columns: 22
## $ president           <chr> "Joseph R. Biden Jr.", "Joseph R. Biden Jr.", "Jos…
## $ subgroup            <chr> "All polls", "All polls", "All polls", "All polls"…
## $ modeldate           <chr> "9/10/2021", "9/10/2021", "9/10/2021", "9/10/2021"…
## $ startdate           <chr> "1/24/2021", "1/24/2021", "1/25/2021", "1/25/2021"…
## $ enddate             <chr> "1/26/2021", "1/27/2021", "1/27/2021", "1/26/2021"…
## $ pollster            <chr> "Rasmussen Reports/Pulse Opinion Research", "Maris…
## $ grade               <chr> "B", "A", "B", "B", "B", "B", "B", "B+", "B", "B",…
## $ samplesize          <dbl> 1500, 1313, 1500, 2200, 15000, 9212, 1500, 906, 15…
## $ population          <chr> "lv", "a", "lv", "a", "a", "a", "lv", "a", "a", "a…
## $ weight              <dbl> 0.3225, 2.1893, 0.3025, 0.1205, 0.2739, 0.1520, 0.…
## $ influence           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ approve             <dbl> 48.0, 49.0, 49.0, 58.0, 54.0, 55.0, 50.0, 57.0, 54…
## $ disapprove          <dbl> 48.0, 35.0, 48.0, 32.0, 31.0, 31.5, 45.0, 37.0, 32…
## $ adjusted_approve    <dbl> 50.4, 50.0, 51.4, 56.5, 52.5, 53.5, 52.4, 56.3, 52…
## $ adjusted_disapprove <dbl> 41.9, 34.6, 41.9, 35.4, 34.4, 34.9, 38.9, 36.4, 35…
## $ multiversions       <chr> NA, NA, NA, NA, NA, "*", NA, NA, NA, NA, NA, NA, N…
## $ tracking            <lgl> TRUE, NA, TRUE, NA, TRUE, TRUE, TRUE, NA, TRUE, NA…
## $ url                 <chr> "https://www.rasmussenreports.com/public_content/p…
## $ poll_id             <dbl> 74261, 74320, 74268, 74346, 74277, 74292, 74290, 7…
## $ question_id         <dbl> 139433, 139558, 139483, 139653, 139497, 139518, 13…
## $ createddate         <chr> "1/27/2021", "2/1/2021", "1/28/2021", "2/5/2021", …
## $ timestamp           <chr> "18:35:08 10 Sep 2021", "18:35:08 10 Sep 2021", "1…
# Use `lubridate` to fix dates, as they are given as charactersa
clean_approval_polllist <- approval_polllist %>% mutate(
  startdate = mdy(approval_polllist$startdate),
  enddate = mdy(approval_polllist$enddate)  
)

3.1 Create a plot

We calculate the average net approval rate (approve- disapprove) for each week since he got into office. We plot the net approval, along with its 95% confidence interval.

The plot should look like this:

`

3.2 Compare Confidence Intervals

The confidence interval for week 4 is [5.28-23.60] in comparison to the confidence interval for week 25 which is [9.79-13.47]. The widths of these two confidence intervals are different, 18.32 for week 4 and 3.68 for week 25. This is due to the difference in sample size, as there were 8 polls in week 4 but 28 polls in week 25.

4 Gapminder revisited

library(gapminder)
skim(gapminder)
Data summary
Name gapminder
Number of rows 1704
Number of columns 6
_______________________
Column type frequency:
factor 2
numeric 4
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
country 0 1 FALSE 142 Afg: 12, Alb: 12, Alg: 12, Ang: 12
continent 0 1 FALSE 5 Afr: 624, Asi: 396, Eur: 360, Ame: 300

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 1.98e+03 1.73e+01 1952.0 1.97e+03 1.98e+03 1.99e+03 2.01e+03 ▇▅▅▅▇
lifeExp 0 1 5.95e+01 1.29e+01 23.6 4.82e+01 6.07e+01 7.08e+01 8.26e+01 ▁▆▇▇▇
pop 0 1 2.96e+07 1.06e+08 60011.0 2.79e+06 7.02e+06 1.96e+07 1.32e+09 ▇▁▁▁▁
gdpPercap 0 1 7.22e+03 9.86e+03 241.2 1.20e+03 3.53e+03 9.33e+03 1.14e+05 ▇▁▁▁▁
# load gapminder HIV data
hiv <- read_csv(here::here("data","adults_with_hiv_percent_age_15_49.csv"))
life_expectancy <- read_csv(here::here("data","life_expectancy_years.csv"))

# get World bank data using wbstats
indicators <- c("SP.DYN.TFRT.IN","SE.PRM.NENR", "SH.DYN.MORT", "NY.GDP.PCAP.KD")


library(wbstats)

worldbank_data <- wb_data(country="countries_only", #countries only- no aggregates like Latin America, Europe, etc.
                          indicator = indicators, 
                          start_date = 1960, 
                          end_date = 2016)

# get a dataframe of information regarding countries, indicators, sources, regions, indicator topics, lending types, income levels,  from the World Bank API 
countries <-  wbstats::wb_cachelist$countries

You have to join the 3 dataframes (life_expectancy, worldbank_data, and HIV) into one. You may need to tidy your data first and then perform join operations. Think about what type makes the most sense and explain why you chose it.

Firstly, we clean the 3 dataframes.

hiv_tidy <- hiv %>% 
             select(1:34) %>% 
            pivot_longer(cols=2:34,
                         names_to="year",
                         values_to = "hiv")%>% 
  drop_na()

hiv_tidy
## # A tibble: 3,301 × 3
##    country     year    hiv
##    <chr>       <chr> <dbl>
##  1 Afghanistan 2009   0.06
##  2 Afghanistan 2010   0.06
##  3 Afghanistan 2011   0.06
##  4 Algeria     1990   0.06
##  5 Algeria     1991   0.06
##  6 Algeria     1992   0.06
##  7 Algeria     1993   0.06
##  8 Algeria     1994   0.06
##  9 Algeria     1995   0.06
## 10 Algeria     1996   0.06
## # … with 3,291 more rows
skim(hiv_tidy)
Data summary
Name hiv_tidy
Number of rows 3301
Number of columns 3
_______________________
Column type frequency:
character 2
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
country 0 1 3 24 0 154 0
year 0 1 4 4 0 31 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
hiv 0 1 1.74 4.09 0.01 0.1 0.3 1.2 26.5 ▇▁▁▁▁
life_expectancy_tidy <- life_expectancy %>%
    select(1:302) %>%
  pivot_longer(2:302,
               names_to="year",
               values_to="life_expectancy") %>%
  drop_na()
life_expectancy_tidy
## # A tibble: 55,528 × 3
##    country     year  life_expectancy
##    <chr>       <chr>           <dbl>
##  1 Afghanistan 1800             28.2
##  2 Afghanistan 1801             28.2
##  3 Afghanistan 1802             28.2
##  4 Afghanistan 1803             28.2
##  5 Afghanistan 1804             28.2
##  6 Afghanistan 1805             28.2
##  7 Afghanistan 1806             28.1
##  8 Afghanistan 1807             28.1
##  9 Afghanistan 1808             28.1
## 10 Afghanistan 1809             28.1
## # … with 55,518 more rows
worldbank_data_tidy <- worldbank_data %>%
  mutate(year = as.character(date)) %>%
  select(!date)
worldbank_data_tidy
## # A tibble: 12,369 × 8
##    iso2c iso3c country NY.GDP.PCAP.KD SE.PRM.NENR SH.DYN.MORT SP.DYN.TFRT.IN
##    <chr> <chr> <chr>            <dbl>       <dbl>       <dbl>          <dbl>
##  1 AW    ABW   Aruba               NA          NA          NA           4.82
##  2 AW    ABW   Aruba               NA          NA          NA           4.66
##  3 AW    ABW   Aruba               NA          NA          NA           4.47
##  4 AW    ABW   Aruba               NA          NA          NA           4.27
##  5 AW    ABW   Aruba               NA          NA          NA           4.06
##  6 AW    ABW   Aruba               NA          NA          NA           3.84
##  7 AW    ABW   Aruba               NA          NA          NA           3.62
##  8 AW    ABW   Aruba               NA          NA          NA           3.42
##  9 AW    ABW   Aruba               NA          NA          NA           3.23
## 10 AW    ABW   Aruba               NA          NA          NA           3.05
## # … with 12,359 more rows, and 1 more variable: year <chr>

The operation for joining the three dataframes should be left join, where the left dataframe is the shortest one. This allows us to make sure there are no missing values (NAs) in our new dataframe.

table_gapminder <- inner_join(left_join(left_join(hiv_tidy, worldbank_data_tidy, by = c("country" = "country", "year" = "year") ), life_expectancy_tidy, by = c("country" = "country","year" = "year")), countries, "country" = "country")

table_gapminder
## # A tibble: 3,117 × 25
##    country     year    hiv iso2c iso3c NY.GDP.PCAP.KD SE.PRM.NENR SH.DYN.MORT
##    <chr>       <chr> <dbl> <chr> <chr>          <dbl>       <dbl>       <dbl>
##  1 Afghanistan 2009   0.06 AF    AFG             488.        NA          91.4
##  2 Afghanistan 2010   0.06 AF    AFG             543.        NA          87.6
##  3 Afghanistan 2011   0.06 AF    AFG             529.        NA          83.9
##  4 Algeria     1990   0.06 DZ    DZA            3572.        85.9        49.5
##  5 Algeria     1991   0.06 DZ    DZA            3444.        87.6        48.1
##  6 Algeria     1992   0.06 DZ    DZA            3424.        88.7        47  
##  7 Algeria     1993   0.06 DZ    DZA            3279.        88.2        45.9
##  8 Algeria     1994   0.06 DZ    DZA            3183.        88.2        44.8
##  9 Algeria     1995   0.06 DZ    DZA            3241.        88.4        43.6
## 10 Algeria     1996   0.06 DZ    DZA            3315.        87.9        42.5
## # … with 3,107 more rows, and 17 more variables: SP.DYN.TFRT.IN <dbl>,
## #   life_expectancy <dbl>, capital_city <chr>, longitude <dbl>, latitude <dbl>,
## #   region_iso3c <chr>, region_iso2c <chr>, region <chr>,
## #   admin_region_iso3c <chr>, admin_region_iso2c <chr>, admin_region <chr>,
## #   income_level_iso3c <chr>, income_level_iso2c <chr>, income_level <chr>,
## #   lending_type_iso3c <chr>, lending_type_iso2c <chr>, lending_type <chr>
  1. What is the relationship between HIV prevalence and life expectancy?
table_gapminder %>%
  ggplot(aes(x=hiv,y=life_expectancy))+
  geom_point()+
  geom_smooth(method="lm")+
  facet_wrap(~region)+
  labs(
    title= "Relationship between HIV prevalence and life expectancy",
    x= "HIV prevalence (% of population)",
    y ="Life expectancy (years)"
  )

The graphs highlight a negative correlation between HIV prevalence and life expectancy in all regions beside Europe and Central Asia. This means that the higher the HIV prevalence, usually, the lower the life expectancy. It can be hypothesised that in Europe people are able to get treatment and medication for HIV, increasing their life expectancy.

  1. What is the relationship between fertility rate and GDP per capita?
table_gapminder %>%
  ggplot(aes(x=NY.GDP.PCAP.KD,y=SP.DYN.TFRT.IN))+
  geom_point()+
  facet_wrap(~region)+
  geom_smooth(method = "lm")+
  labs(
    title="Relationship between fertility rate and GDP per capita",
    y="Fertility rate",
    x= "GDP per capita"
  )

For all regions beside Europe & Central Asia and Middle East & North Africa we can highlight an inverse correlation, showcasing that the higher the GDP per capita, the lower the fertility rate. This demonstrates the connection between economic considerations and fertility choices, as poorer countries tend to have a higher fertility rate.

  1. Which regions have the most observations with missing HIV data?
hiv_na <- hiv %>% 
             select(1:34) %>% 
            pivot_longer(cols=2:34,
             names_to="year",
             values_to = "hiv")
hiv_na_country <- left_join(hiv_na, countries, "country"= "country")

hiv_na_country %>%
  filter(!is.na(region)) %>%
  group_by(region) %>%
  summarise(na=sum(is.na(hiv)))%>%
  mutate(region=fct_reorder(region, na))%>%
  ggplot(aes(x=na,y=region))+
  geom_col()+
  labs(
    title="Number of missing HIV observations per region",
    y="",
    x= "Number of NA values"
  )

Sub-Saharan Africa and Europe & Central Asia have the highest number of NA values.

  1. How has mortality rate for under 5 changed by region?
endpoints <- table_gapminder %>%
  group_by(country)%>%
  summarise(min_year=min(year), max_year=max(year))

endpoints
## # A tibble: 143 × 3
##    country     min_year max_year
##    <chr>       <chr>    <chr>   
##  1 Afghanistan 2009     2011    
##  2 Algeria     1990     2008    
##  3 Angola      1979     2011    
##  4 Argentina   1979     2011    
##  5 Armenia     1990     2011    
##  6 Australia   1983     2011    
##  7 Austria     1981     2011    
##  8 Azerbaijan  1990     2011    
##  9 Bangladesh  1990     2011    
## 10 Barbados    1979     2011    
## # … with 133 more rows
join_min <- left_join(table_gapminder, endpoints, "country"="country") %>%
  select(country, year, min_year, max_year, SH.DYN.MORT)%>%
  mutate(mortality_origin = ifelse(year == min_year, SH.DYN.MORT, 0))%>%
  select(!year)%>%
  filter(!mortality_origin==0)%>%
  select(!SH.DYN.MORT)

join_min
## # A tibble: 139 × 4
##    country     min_year max_year mortality_origin
##    <chr>       <chr>    <chr>               <dbl>
##  1 Afghanistan 2009     2011                 91.4
##  2 Algeria     1990     2008                 49.5
##  3 Argentina   1979     2011                 46.8
##  4 Armenia     1990     2011                 48.9
##  5 Australia   1983     2011                 11.9
##  6 Austria     1981     2011                 15.5
##  7 Azerbaijan  1990     2011                 95.2
##  8 Bangladesh  1990     2011                144. 
##  9 Barbados    1979     2011                 28.6
## 10 Belarus     1990     2011                 15.2
## # … with 129 more rows
join_max <- left_join(table_gapminder, endpoints, "country"="country") %>%
  select(country, year, min_year, max_year, SH.DYN.MORT)%>%
  mutate(mortality_end = ifelse(year == max_year, SH.DYN.MORT, 0))%>%
  select(!year)%>%
  filter(!mortality_end==0)%>%
  select(!SH.DYN.MORT)

join_max
## # A tibble: 143 × 4
##    country     min_year max_year mortality_end
##    <chr>       <chr>    <chr>            <dbl>
##  1 Afghanistan 2009     2011              83.9
##  2 Algeria     1990     2008              29.5
##  3 Angola      1979     2011             112. 
##  4 Argentina   1979     2011              13.9
##  5 Armenia     1990     2011              17.6
##  6 Australia   1983     2011               4.5
##  7 Austria     1981     2011               4.2
##  8 Azerbaijan  1990     2011              35  
##  9 Bangladesh  1990     2011              46.1
## 10 Barbados    1979     2011              14.9
## # … with 133 more rows
join_minmax <- left_join(join_min, join_max, "country"="country")
join_minmax
## # A tibble: 139 × 5
##    country     min_year max_year mortality_origin mortality_end
##    <chr>       <chr>    <chr>               <dbl>         <dbl>
##  1 Afghanistan 2009     2011                 91.4          83.9
##  2 Algeria     1990     2008                 49.5          29.5
##  3 Argentina   1979     2011                 46.8          13.9
##  4 Armenia     1990     2011                 48.9          17.6
##  5 Australia   1983     2011                 11.9           4.5
##  6 Austria     1981     2011                 15.5           4.2
##  7 Azerbaijan  1990     2011                 95.2          35  
##  8 Bangladesh  1990     2011                144.           46.1
##  9 Barbados    1979     2011                 28.6          14.9
## 10 Belarus     1990     2011                 15.2           5.1
## # … with 129 more rows
by_region <- left_join(join_minmax, countries, "country"="country") %>%
  select(country,region,min_year,max_year,mortality_origin,mortality_end)
by_region
## # A tibble: 139 × 6
##    country     region           min_year max_year mortality_origin mortality_end
##    <chr>       <chr>            <chr>    <chr>               <dbl>         <dbl>
##  1 Afghanistan South Asia       2009     2011                 91.4          83.9
##  2 Algeria     Middle East & N… 1990     2008                 49.5          29.5
##  3 Argentina   Latin America &… 1979     2011                 46.8          13.9
##  4 Armenia     Europe & Centra… 1990     2011                 48.9          17.6
##  5 Australia   East Asia & Pac… 1983     2011                 11.9           4.5
##  6 Austria     Europe & Centra… 1981     2011                 15.5           4.2
##  7 Azerbaijan  Europe & Centra… 1990     2011                 95.2          35  
##  8 Bangladesh  South Asia       1990     2011                144.           46.1
##  9 Barbados    Latin America &… 1979     2011                 28.6          14.9
## 10 Belarus     Europe & Centra… 1990     2011                 15.2           5.1
## # … with 129 more rows
region_list = c("East Asia & Pacific"   ,"Europe & Central Asia"    ,"Latin America & Caribbean","Middle East & North Africa","North America","South Asia","Sub-Saharan Africa")

minmax_data <- left_join(join_minmax, countries, "country"="country") %>%
  
  mutate(min_year = as.numeric(min_year),max_year = as.numeric(max_year),mortality_origin = as.numeric(mortality_origin),mortality_end = as.numeric(mortality_end),evolution_of_mortality = (mortality_end-mortality_origin)/mortality_origin)%>%
  
  arrange(desc(evolution_of_mortality))%>%
  
  select(country, region,min_year, max_year, mortality_origin, mortality_end, evolution_of_mortality)

The following tables illustrate the top 5 and the lowest 5 countries according to improvement in mortality rates in each region.

East Asia & Pacific

minmax_data %>%
  filter(region == region_list[1])%>%
  slice_max(order_by = evolution_of_mortality,n = 5)
## # A tibble: 5 × 7
##   country          region              min_year max_year mortality_origin mortality_end
##   <chr>            <chr>                  <dbl>    <dbl>            <dbl>         <dbl>
## 1 China            East Asia & Pacific     2011     2011             14.6          14.6
## 2 Fiji             East Asia & Pacific     1990     2011             28.9          23.6
## 3 Papua New Guinea East Asia & Pacific     1987     2011             89.9          56  
## 4 Philippines      East Asia & Pacific     1990     2011             56.6          31.3
## 5 Japan            East Asia & Pacific     1990     2011              6.3           3.2
## # … with 1 more variable: evolution_of_mortality <dbl>
minmax_data %>%
  filter(region == region_list[1])%>%
  slice_max(order_by = -evolution_of_mortality,n = 5)
## # A tibble: 5 × 7
##   country   region              min_year max_year mortality_origin mortality_end
##   <chr>     <chr>                  <dbl>    <dbl>            <dbl>         <dbl>
## 1 Mongolia  East Asia & Pacific     1990     2011            108.           27.9
## 2 Thailand  East Asia & Pacific     1985     2011             47.8          13  
## 3 Cambodia  East Asia & Pacific     1985     2011            120.           40.6
## 4 Singapore East Asia & Pacific     1990     2011              7.7           2.8
## 5 Australia East Asia & Pacific     1983     2011             11.9           4.5
## # … with 1 more variable: evolution_of_mortality <dbl>

Europe & Central Asia

minmax_data %>%
  filter(region == region_list[2])%>%
  slice_max(order_by = evolution_of_mortality,n = 5)
## # A tibble: 5 × 7
##   country     region                min_year max_year mortality_origin mortality_end
##   <chr>       <chr>                    <dbl>    <dbl>            <dbl>         <dbl>
## 1 Ukraine     Europe & Central Asia     1990     2011             19.4          11.2
## 2 Bulgaria    Europe & Central Asia     1990     2011             18.4          10.4
## 3 Uzbekistan  Europe & Central Asia     1990     2008             72.3          38.5
## 4 Moldova     Europe & Central Asia     1990     2011             33.2          16.8
## 5 Switzerland Europe & Central Asia     1985     2011              9             4.5
## # … with 1 more variable: evolution_of_mortality <dbl>
minmax_data %>%
  filter(region == region_list[2])%>%
  slice_max(order_by = -evolution_of_mortality,n = 5)
## # A tibble: 5 × 7
##   country    region                min_year max_year mortality_origin mortality_end
##   <chr>      <chr>                    <dbl>    <dbl>            <dbl>         <dbl>
## 1 Portugal   Europe & Central Asia     1979     2011             29.7           3.7
## 2 Greece     Europe & Central Asia     1979     2011             21             3.9
## 3 Serbia     Europe & Central Asia     1985     2011             39.1           7.4
## 4 Luxembourg Europe & Central Asia     1979     2011             13.9           3  
## 5 Italy      Europe & Central Asia     1979     2011             17             3.9
## # … with 1 more variable: evolution_of_mortality <dbl>

Latin America & Caribbean

minmax_data %>%
  filter(region == region_list[3])%>%
  slice_max(order_by = evolution_of_mortality,n = 5)
## # A tibble: 5 × 7
##   country             region    min_year max_year mortality_origin mortality_end
##   <chr>               <chr>        <dbl>    <dbl>            <dbl>         <dbl>
## 1 Costa Rica          Latin Am…     1990     2011             16.9          10.2
## 2 Trinidad and Tobago Latin Am…     1982     2011             36.9          22.2
## 3 Barbados            Latin Am…     1979     2011             28.6          14.9
## 4 Guyana              Latin Am…     1979     2011             70.5          36.5
## 5 Belize              Latin Am…     1990     2011             38.3          18.3
## # … with 1 more variable: evolution_of_mortality <dbl>
minmax_data %>%
  filter(region == region_list[3])%>%
  slice_max(order_by = -evolution_of_mortality,n = 5)
## # A tibble: 5 × 7
##   country     region            min_year max_year mortality_origin mortality_end
##   <chr>       <chr>                <dbl>    <dbl>            <dbl>         <dbl>
## 1 Peru        Latin America & …     1979     2011            129            18.9
## 2 Brazil      Latin America & …     1979     2011            100.           17.9
## 3 Ecuador     Latin America & …     1979     2011             96.1          17.5
## 4 El Salvador Latin America & …     1985     2011             79.8          18.3
## 5 Uruguay     Latin America & …     1979     2011             44            10.3
## # … with 1 more variable: evolution_of_mortality <dbl>

Middle East & North Africa

minmax_data %>%
  filter(region == region_list[4])%>%
  slice_max(order_by = evolution_of_mortality,n = 5)
## # A tibble: 5 × 7
##   country  region                     min_year max_year mortality_origin mortality_end
##   <chr>    <chr>                         <dbl>    <dbl>            <dbl>         <dbl>
## 1 Malta    Middle East & North Africa     1990     2011             11.4           6.8
## 2 Algeria  Middle East & North Africa     1990     2008             49.5          29.5
## 3 Djibouti Middle East & North Africa     1983     2011            142.           74.1
## 4 Qatar    Middle East & North Africa     1990     2008             20.8           9.6
## 5 Morocco  Middle East & North Africa     1990     2011             79.1          30.4
## # … with 1 more variable: evolution_of_mortality <dbl>
minmax_data %>%
  filter(region == region_list[4])%>%
  slice_max(order_by = -evolution_of_mortality,n = 5)
## # A tibble: 5 × 7
##   country region                     min_year max_year mortality_origin mortality_end
##   <chr>   <chr>                         <dbl>    <dbl>            <dbl>         <dbl>
## 1 Lebanon Middle East & North Africa     1979     2011             51.5           9.8
## 2 Israel  Middle East & North Africa     1979     2011             19.1           4.4
## 3 Oman    Middle East & North Africa     1990     2008             38.7          12  
## 4 Tunisia Middle East & North Africa     1990     2011             55.3          18  
## 5 Morocco Middle East & North Africa     1990     2011             79.1          30.4
## # … with 1 more variable: evolution_of_mortality <dbl>

North America

minmax_data %>%
  filter(region == region_list[5])%>%
  slice_max(order_by = evolution_of_mortality,n = 5)
## # A tibble: 2 × 7
##   country       region        min_year max_year mortality_origin mortality_end
##   <chr>         <chr>            <dbl>    <dbl>            <dbl>         <dbl>
## 1 United States North America     1979     2011             15.6           7.2
## 2 Canada        North America     1979     2011             13.3           5.6
## # … with 1 more variable: evolution_of_mortality <dbl>
minmax_data %>%
  filter(region == region_list[5])%>%
  slice_max(order_by = -evolution_of_mortality,n = 5)
## # A tibble: 2 × 7
##   country       region        min_year max_year mortality_origin mortality_end
##   <chr>         <chr>            <dbl>    <dbl>            <dbl>         <dbl>
## 1 Canada        North America     1979     2011             13.3           5.6
## 2 United States North America     1979     2011             15.6           7.2
## # … with 1 more variable: evolution_of_mortality <dbl>

South Asia

minmax_data %>%
  filter(region == region_list[6])%>%
  slice_max(order_by = evolution_of_mortality,n = 5)
## # A tibble: 5 × 7
##   country     region     min_year max_year mortality_origin mortality_end
##   <chr>       <chr>         <dbl>    <dbl>            <dbl>         <dbl>
## 1 Afghanistan South Asia     2009     2011             91.4          83.9
## 2 Pakistan    South Asia     1990     2011            140.           85  
## 3 Sri Lanka   South Asia     1990     2011             21.9          11.2
## 4 India       South Asia     1990     2009            126.           61.4
## 5 Bangladesh  South Asia     1990     2011            144.           46.1
## # … with 1 more variable: evolution_of_mortality <dbl>
minmax_data %>%
  filter(region == region_list[6])%>%
  slice_max(order_by = -evolution_of_mortality,n = 5)
## # A tibble: 5 × 7
##   country    region     min_year max_year mortality_origin mortality_end
##   <chr>      <chr>         <dbl>    <dbl>            <dbl>         <dbl>
## 1 Maldives   South Asia     1990     2011             85.8          13  
## 2 Nepal      South Asia     1985     2011            175.           44.2
## 3 Bhutan     South Asia     1990     2011            127            39.9
## 4 Bangladesh South Asia     1990     2011            144.           46.1
## 5 India      South Asia     1990     2009            126.           61.4
## # … with 1 more variable: evolution_of_mortality <dbl>

Sub-Saharan Africa

minmax_data %>%
  filter(region == region_list[7])%>%
  slice_max(order_by = evolution_of_mortality,n = 5)
## # A tibble: 5 × 7
##   country               region  min_year max_year mortality_origin mortality_end
##   <chr>                 <chr>      <dbl>    <dbl>            <dbl>         <dbl>
## 1 South Sudan           Sub-Sa…     2011     2011            102.          102. 
## 2 Lesotho               Sub-Sa…     1984     2011            101.           96.1
## 3 Ethiopia              Sub-Sa…     2009     2011             86.7          77.6
## 4 Sao Tome and Principe Sub-Sa…     2009     2011             47.9          42.4
## 5 Somalia               Sub-Sa…     1987     2011            186           151. 
## # … with 1 more variable: evolution_of_mortality <dbl>
minmax_data %>%
  filter(region == region_list[7])%>%
  slice_max(order_by = -evolution_of_mortality,n = 5)
## # A tibble: 5 × 7
##   country region             min_year max_year mortality_origin mortality_end
##   <chr>   <chr>                 <dbl>    <dbl>            <dbl>         <dbl>
## 1 Rwanda  Sub-Saharan Africa     1979     2011             233.          57.1
## 2 Eritrea Sub-Saharan Africa     1985     2011             178.          53.2
## 3 Malawi  Sub-Saharan Africa     1980     2011             256.          78.6
## 4 Senegal Sub-Saharan Africa     1981     2011             198.          63  
## 5 Uganda  Sub-Saharan Africa     1979     2011             216.          72.5
## # … with 1 more variable: evolution_of_mortality <dbl>

How has mortality rate for under 5 changed by region?

evolution_change <- by_region %>%
  mutate(min_year = as.numeric(min_year),max_year = as.numeric(max_year),mortality_origin = as.numeric(mortality_origin),mortality_end = as.numeric(mortality_end),evolution_of_mortality = (mortality_end-mortality_origin)/mortality_origin)%>%
  arrange(desc(evolution_of_mortality))

evolution_change
## # A tibble: 139 × 7
##    country                  region min_year max_year mortality_origin mortality_end
##    <chr>                    <chr>     <dbl>    <dbl>            <dbl>         <dbl>
##  1 China                    East …     2011     2011             14.6          14.6
##  2 South Sudan              Sub-S…     2011     2011            102.          102. 
##  3 Lesotho                  Sub-S…     1984     2011            101.           96.1
##  4 Afghanistan              South…     2009     2011             91.4          83.9
##  5 Ethiopia                 Sub-S…     2009     2011             86.7          77.6
##  6 Sao Tome and Principe    Sub-S…     2009     2011             47.9          42.4
##  7 Fiji                     East …     1990     2011             28.9          23.6
##  8 Somalia                  Sub-S…     1987     2011            186           151. 
##  9 Mauritania               Sub-S…     1990     2011            118.           93.7
## 10 Central African Republic Sub-S…     1979     2011            188.          143. 
## # … with 129 more rows, and 1 more variable: evolution_of_mortality <dbl>
evolution_by_region <- evolution_change %>%
  group_by(region) %>%
  summarise(mean_mortality_evol = mean(evolution_of_mortality))

evolution_by_region
## # A tibble: 7 × 2
##   region                     mean_mortality_evol
##   <chr>                                    <dbl>
## 1 East Asia & Pacific                     -0.514
## 2 Europe & Central Asia                   -0.651
## 3 Latin America & Caribbean               -0.633
## 4 Middle East & North Africa              -0.598
## 5 North America                           -0.559
## 6 South Asia                              -0.555
## 7 Sub-Saharan Africa                      -0.455
evolution_by_region %>%
  ggplot(aes(y = reorder(region, --mean_mortality_evol), x=100*mean_mortality_evol))+
  geom_col()+
  labs(
    title= "Change in mortality rate by region",
    x= "Change in mortality rate (%)",
    y=""
  )

The biggest change in mortality rate can be noted in Europe and Central Asia, over -60%, meaning that the rate has gone down. The least difference can be noticed in Sub-Saharan Africa.

  1. Is there a relationship between primary school enrollment and fertility rate?
table_gapminder %>%
  group_by(country) %>%
  ggplot(aes(x=SE.PRM.NENR,y=SP.DYN.TFRT.IN))+
  geom_point()+
  facet_wrap(~region)+
  geom_smooth(method = "lm")+
  labs(
    title="Relationship between fertility rate and primary school enrollment",
    y="Fertility rate",
    x= "Primary school enrollment, (%) population"
  )

The higher the primary school enrollment in all regions but Europe and Central Asia, the lower the fertility rate.

5 Challenge 1: Excess rentals in TfL bike sharing

url <- "https://data.london.gov.uk/download/number-bicycle-hires/ac29363e-e0cb-47cc-a97a-e216d900a6b0/tfl-daily-cycle-hires.xlsx"

# Download TFL data to temporary file
httr::GET(url, write_disk(bike.temp <- tempfile(fileext = ".xlsx")))
## Response [https://airdrive-secure.s3-eu-west-1.amazonaws.com/london/dataset/number-bicycle-hires/2021-08-23T14%3A32%3A29/tfl-daily-cycle-hires.xlsx?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJJDIMAIVZJDICKHA%2F20210911%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Date=20210911T133924Z&X-Amz-Expires=300&X-Amz-Signature=e887fc183fd2074150a05ea30ae23e430d8c0026591c85b86213f78ef9d4f1e1&X-Amz-SignedHeaders=host]
##   Date: 2021-09-11 13:39
##   Status: 200
##   Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
##   Size: 173 kB
## <ON DISK>  /var/folders/91/dm_fdgkx0l56y1cm7zy73wj40000gn/T//RtmpokIBPw/file86f70e20191.xlsx
# Use read_excel to read it as dataframe
bike0 <- read_excel(bike.temp,
                   sheet = "Data",
                   range = cell_cols("A:B"))

# change dates to get year, month, and week
bike <- bike0 %>% 
  clean_names() %>% 
  rename (bikes_hired = number_of_bicycle_hires) %>% 
  mutate (year = year(day),
          month = lubridate::month(day, label = TRUE),
          week = isoweek(day))

It’s interesting to note that during the 2020 lockdown, rentals decreased by half in comparison to other years.

The second one looks at percentage changes from the expected level of weekly rentals. The two grey shaded rectangles correspond to Q2 (weeks 14-26) and Q4 (weeks 40-52).

We used the mean to calculate our expected rentals as when we tested using the median it was quite different from the actual numbers.

6 Details

  • Who did you collaborate with: Our study group
  • Approximately how much time did you spend on this problem set: 3 days
  • What, if anything, gave you the most trouble: Bootstraping

7 Rubric

Check minus (1/5): Displays minimal effort. Doesn’t complete all components. Code is poorly written and not documented. Uses the same type of plot for each graph, or doesn’t use plots appropriate for the variables being analyzed.

Check (3/5): Solid effort. Hits all the elements. No clear mistakes. Easy to follow (both the code and the output).

Check plus (5/5): Finished all components of the assignment correctly and addressed both challenges. Code is well-documented (both self-documented and with additional comments as necessary). Used tidyverse, instead of base R. Graphs and tables are properly labelled. Analysis is clear and easy to follow, either because graphs are labeled clearly or you’ve written additional text to describe how you interpret the output.